34 research outputs found
Recommended from our members
Incremental Non-Greedy Clustering at Scale
Clustering is the task of organizing data into meaningful groups. Modern clustering applications such as entity resolution put several demands on clustering algorithms: (1) scalability to massive numbers of points as well as clusters, (2) incremental additions of data, (3) support for any user-specified similarity functions.
Hierarchical clusterings are often desired as they represent multiple alternative flat clusterings (e.g., at different granularity levels). These tree-structured clusterings provide for both fine-grained clusters as well as uncertainty in the presence of newly arriving data. Previous work on hierarchical clustering does not fully address all three of the aforementioned desiderata. Work on incremental hierarchical clustering often makes greedy, irrevocable clustering decisions that are regretted in the presence of future data. Work on scalable hierarchical clustering does not support incremental additions or deletions. These methods often make requirements on the similarity functions used and/or empirically tend to over merge clusters, which can lead to inaccurate clusterings.
In this thesis, we present incremental and scalable methods for hierarchical clustering to empirically satisfy the above desiderata. Our work aims to represent uncertainty and meaningful alternative clusterings, to efficiently reconsider past decisions in the incremental case, and to use parallelism to scale to massive datasets. Our method, Grinch, handles incrementally arriving data in a non-greedy fashion, by reconsidering past decisions using tree structure re-arrangements (e.g., rotations and grafts) invoked in accordance with the user’s specified similarity function. To achieve scalability to massive datasets, our method, SCC, builds a hierarchical clusterings in a level-wise bottom-up manner. Certain clustering decisions are made independently in parallel within each level, and a global similarity threshold schedule prevents greedy over-merging. We show how SCC can be combined with the tree-structure re-arrangements in Grinch to form a mini-batch algorithm achieving both scalable and incremental performance. Lastly, we generalize our hierarchical clustering approaches to DAG-structured ones, which can better represent uncertainty in clustering by representing overlapping clusters. We introduce an efficient bottom-up method for DAG-structured clustering, Llama. For each of the proposed methods, we provide both a theoretical and empirical analysis. Empirically, our methods achieve state-of-the-art results on clustering benchmarks in both the batch and the incremental settings, including multiple point improvements in dendrogram purity and scalability to billions of points
Efficient k-NN Search with Cross-Encoders using Adaptive Multi-Round CUR Decomposition
Cross-encoder models, which jointly encode and score a query-item pair, are
prohibitively expensive for direct k-nearest neighbor (k-NN) search.
Consequently, k-NN search typically employs a fast approximate retrieval (e.g.
using BM25 or dual-encoder vectors), followed by reranking with a
cross-encoder; however, the retrieval approximation often has detrimental
recall regret. This problem is tackled by ANNCUR (Yadav et al., 2022), a recent
work that employs a cross-encoder only, making search efficient using a
relatively small number of anchor items, and a CUR matrix factorization. While
ANNCUR's one-time selection of anchors tends to approximate the cross-encoder
distances on average, doing so forfeits the capacity to accurately estimate
distances to items near the query, leading to regret in the crucial end-task:
recall of top-k items. In this paper, we propose ADACUR, a method that
adaptively, iteratively, and efficiently minimizes the approximation error for
the practically important top-k neighbors. It does so by iteratively performing
k-NN search using the anchors available so far, then adding these retrieved
nearest neighbors to the anchor set for the next round. Empirically, on
multiple datasets, in comparison to previous traditional and state-of-the-art
methods such as ANNCUR and dual-encoder-based retrieve-and-rerank, our proposed
approach ADACUR consistently reduces recall error-by up to 70% on the important
k = 1 setting-while using no more compute than its competitors.Comment: Findings of EMNLP 202
Entity Linking and Discovery via Arborescence-based Supervised Clustering
Previous work has shown promising results in performing entity linking by
measuring not only the affinities between mentions and entities but also those
amongst mentions. In this paper, we present novel training and inference
procedures that fully utilize mention-to-mention affinities by building minimum
arborescences (i.e., directed spanning trees) over mentions and entities across
documents in order to make linking decisions. We also show that this method
gracefully extends to entity discovery, enabling the clustering of mentions
that do not have an associated entity in the knowledge base. We evaluate our
approach on the Zero-Shot Entity Linking dataset and MedMentions, the largest
publicly available biomedical dataset, and show significant improvements in
performance for both entity linking and discovery compared to identically
parameterized models. We further show significant efficiency improvements with
only a small loss in accuracy over previous work, which use more
computationally expensive models.Comment: Updated reference
Improving Dual-Encoder Training through Dynamic Indexes for Negative Mining
Dual encoder models are ubiquitous in modern classification and retrieval.
Crucial for training such dual encoders is an accurate estimation of gradients
from the partition function of the softmax over the large output space; this
requires finding negative targets that contribute most significantly ("hard
negatives"). Since dual encoder model parameters change during training, the
use of traditional static nearest neighbor indexes can be sub-optimal. These
static indexes (1) periodically require expensive re-building of the index,
which in turn requires (2) expensive re-encoding of all targets using updated
model parameters. This paper addresses both of these challenges. First, we
introduce an algorithm that uses a tree structure to approximate the softmax
with provable bounds and that dynamically maintains the tree. Second, we
approximate the effect of a gradient update on target encodings with an
efficient Nystrom low-rank approximation. In our empirical study on datasets
with over twenty million targets, our approach cuts error by half in relation
to oracle brute-force negative mining. Furthermore, our method surpasses prior
state-of-the-art while using 150x less accelerator memory.Comment: To appear at AISTATS 202